30 research outputs found

    Analyzing The Community Structure Of Web-like Networks: Models And Algorithms

    Get PDF
    This dissertation investigates the community structure of web-like networks (i.e., large, random, real-life networks such as the World Wide Web and the Internet). Recently, it has been shown that many such networks have a locally dense and globally sparse structure with certain small, dense subgraphs occurring much more frequently than they do in the classical Erdös-Rényi random graphs. This peculiarity--which is commonly referred to as community structure--has been observed in seemingly unrelated networks such as the Web, email networks, citation networks, biological networks, etc. The pervasiveness of this phenomenon has led many researchers to believe that such cohesive groups of nodes might represent meaningful entities. For example, in the Web such tightly-knit groups of nodes might represent pages with a common topic, geographical location, etc., while in the neural networks they might represent evolved computational units. The notion of community has emerged in an effort to formalize the empirical observation of the locally dense globally sparse structure of web-like networks. In the broadest sense, a community in a web-like network is defined as a group of nodes that induces a dense subgraph which is sparsely linked with the rest of the network. Due to a wide array of envisioned applications, ranging from crawlers and search engines to network security and network compression, there has recently been a widespread interest in finding efficient community-mining algorithms. In this dissertation, the community structure of web-like networks is investigated by a combination of analytical and computational techniques: First, we consider the problem of modeling the web-like networks. In the recent years, many new random graph models have been proposed to account for some recently discovered properties of web-like networks that distinguish them from the classical random graphs. The vast majority of these random graph models take into account only the addition of new nodes and edges. Yet, several empirical observations indicate that deletion of nodes and edges occurs frequently in web-like networks. Inspired by such observations, we propose and analyze two dynamic random graph models that combine node and edge addition with a uniform and a preferential deletion of nodes, respectively. In both cases, we find that the random graphs generated by such models follow power-law degree distributions (in agreement with the degree distribution of many web-like networks). Second, we analyze the expected density of certain small subgraphs--such as defensive alliances on three and four nodes--in various random graphs models. Our findings show that while in the binomial random graph the expected density of such subgraphs is very close to zero, in some dynamic random graph models it is much larger. These findings converge with our results obtained by computing the number of communities in some Web crawls. Next, we investigate the computational complexity of the community-mining problem under various definitions of community. Assuming the definition of community as a global defensive alliance, or a global offensive alliance we prove--using transformations from the dominating set problem--that finding optimal communities is an NP-complete problem. These and other similar complexity results coupled with the fact that many web-like networks are huge, indicate that it is unlikely that fast, exact sequential algorithms for mining communities may be found. To handle this difficulty we adopt an algorithmic definition of community and a simpler version of the community-mining problem, namely: find the largest community to which a given set of seed nodes belong. We propose several greedy algorithms for this problem: The first proposed algorithm starts out with a set of seed nodes--the initial community--and then repeatedly selects some nodes from community\u27s neighborhood and pulls them in the community. In each step, the algorithm uses clustering coefficient--a parameter that measures the fraction of the neighbors of a node that are neighbors themselves--to decide which nodes from the neighborhood should be pulled in the community. This algorithm has time complexity of order , where denotes the number of nodes visited by the algorithm and is the maximum degree encountered. Thus, assuming a power-law degree distribution this algorithm is expected to run in near-linear time. The proposed algorithm achieved good accuracy when tested on some real and computer-generated networks: The fraction of community nodes classified correctly is generally above 80% and often above 90% . A second algorithm based on a generalized clustering coefficient, where not only the first neighborhood is taken into account but also the second, the third, etc., is also proposed. This algorithm achieves a better accuracy than the first one but also runs slower. Finally, a randomized version of the second algorithm which improves the time complexity without affecting the accuracy significantly, is proposed. The main target application of the proposed algorithms is focused crawling--the selective search for web pages that are relevant to a pre-defined topic

    Evaluation of a Graph-based Topical Crawler

    No full text
    Abstract – Topical (or, focused) crawlers have become important tools in dealing with the massiveness and dynamic nature of the World Wide Web. Guided by a data mining component that monitors and analyzes the boundary of the set of crawled pages, a focused crawler selectively seeks out pages on a pre-defined topic. Recent research indicates that both the textual content of web pages and the structural information enclosed in the Web graph need to be exploited in order to build high quality focused crawlers. While, a variety of text-based and graphbased measures of similarity that can direct a focused crawler toward relevant pages have been developed, much remains to be done toward formally evaluating and ranking the effectiveness of various focused crawling algorithms. Inspired by a recent and comprehensive evaluation framework for focused crawlers, we analyze the performance of a graph-based algorithm and compare it with two other algorithms: a breadth-first one and a textbased, best-first one. The results suggest that our graphbased algorithm is faster and only slightly less effective than the text-based, best-first algorithm, while significantly outperforming the breadth-first one

    Mining Parameters That Characterize The Communities In Web-Like Networks

    No full text
    Community mining in large, complex, real-life networks such as the World Wide Web has emerged as a key data mining problem with important applications. In recent years, several graph theoretic definitions of community, generally motivated by empirical observations and intuitive arguments, have been put forward. However, a formal evaluation of the appropriateness of such definitions has been lacking. We present a new framework developed to address this issue, and then discuss a particular implementation of this framework. Finally, we present a set of experiments aimed at evaluating the effectiveness of two specific graph theoretic structures-alliance and near-clique - in capturing the essential properties of communities. © 2006 IEEE

    A Birth-Death Dynamic Model Of Scale-Free Networks

    No full text
    We study a dynamic model of scale-free networks which incorporates not only the birth of vertices and edges but also their death. We analyze the degree distribution of this model by employing a mean-field approach and numerical simulations. Copyright 2005 ACM

    Techniques For Analyzing Dynamic Random Graph Models Of Web-Like Networks: An Overview

    No full text
    Various random graph models have recently been proposed to replicate and explain the topology of large, complex, real-life networks such as the World Wide Web and the Internet. These models are surveyed in this article. Our focus has primarily been on dynamic random graph models that attempt to account for the observed statistical properties of web-like networks through certain dynamic processes guided by simple stochastic rules. Particular attention is paid to the equivalence between mathematical definitions of dynamic random graphs in terms of inductively defined probability spaces and algorithmic definitions of such models in terms of recursive procedures. Several techniques that have been employed for studying dynamic random graphs-both heuristic and analytic-are expounded. Each technique is illustrated through its application in analyzing various graph parameters, such as degree distribution, degreecorrelation between adjacent nodes, clustering coefficient, distribution of node-pair distances, and connected-component size. A discussion of the most recent salient work and a comprehensive list of references in this rapidly-expanding area are included. © 2007 Wiley Periodicals, Inc

    Preferential Deletion In Dynamic Models Of Web-Like Networks

    No full text
    In this paper a discrete-time dynamic random graph process is studied that interleaves the birth of nodes and edges with the death of nodes. In this model, at each time step either a new node is added or an existing node is deleted. A node is added with probability p together with an edge incident on it. The node at the other end of this new edge is chosen based on a linear preferential attachment rule. A node (and all the edges incident on it) is deleted with probability q = 1 - p. The node to be deleted is chosen based on a probability distribution that favors small-degree nodes, in view of recent empirical findings. We analyze the degree distribution of this model and find that the expected fraction of nodes with degree k in the graph generated by this process decreases asymptotically as k- 1 - (2 p / 2 p - 1). © 2007 Elsevier B.V. All rights reserved
    corecore